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Abstract 



Reversible logic has applications in various research areas including sig- 
(— I nal processing, cryptography and quantum computation. In this paper, 

^ ^ direct NCT-based synthesis of a given fe-cycle in a cycle-based synthe- 

I sis scenario is examined. To this end, a set of seven building blocks is 

(-H proposed that reveals the potential of direct synthesis of a given permu- 

tation to reduce both quantum cost and average runtime. To synthesize 
^ a given large cycle, we propose a decomposition algorithm to extract the 

suggested building blocks from the input specification. Then, a synthesis 
method is introduced which uses the building blocks and the decomposi- 
tion algorithm. Finally, a hybrid synthesis framework is suggested which 
^ uses the proposed cycle-based synthesis method in conjunction with one 

of the recent NCT-based synthesis approaches which is based on Reed- 
^ MuUer (RM) spectra. 

CO The time complexity and the effectiveness of the proposed synthesis ap- 

proach are analyzed in detail. Our analyses show that the proposed hy- 
brid framework leads to a better quantum cost in the worst-case scenario 

(•~^ compared to the previously presented methods. The proposed framework 

always converges and typically synthesizes a given specification very fast 

^— H compared to the available synthesis algorithms. Besides, the quantum 

^ costs of benchmark functions are improved about 20% on average (55% 

. ^ in the best case) . 

1 Introduction 



Reversible computing deals with any computational process that is time-invertible, 
meaning that the process can also be computed backward through time. A nec- 
essary condition for reversibility is that the transition function applied to map 
inputs onto outputs works as a one-to-one function to have a unique output as- 
signment for each input pattern. Generally, conventional logic gates other than 
NOT are not reversible, as their inputs cannot be determined from the related 
outputs uniquely. 
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One of the motivations for research on reversible computing is that it offers 
a potential way to improve the energy efficiency of computers beyond the fun- 
damental Landauer limit introduced in 1961 [1]. Landauer proved that using 
conventional irreversible logic gates leads to at least kT x ln2 energy dissipation 
per irreversible bit operation, regardless of the underlying circuit, where k is 
Boltzmann's constant, and T is the temperature of the environment. In 1973, 
Bennett stated that to avoid power dissipation in a circuit, the circuit must 
be built from reversible gates [2]- This has made reversible computing an at- 
tractive option for low-power design [3] , [4] . Additionally, the field of reversible 
computing has received considerable attention in quantum computing as each 
quantum gate is reversible in nature [5]- 

Among various open research problems related to the field of reversible com- 
puting, reversible logic synthesis, defined as the ability to generate an efficient 
circuit from a given arbitrary-size specification, is considered as a stepping- 
stone towards realization of useful reversible hardware. As a result, working 
on synthesis methods for reversible circuits has received a significant attention 
recently (for examples see [B], [7] and [S]). As loop and fanout are not allowed 
in reversible circuits, and each gate must have the same number of inputs and 
outputs with unique input /output assignments in the transition function, ma- 
ture irreversible synthesis algorithms cannot be directly applied to reversible 
circuits. 

To synthesize a given reversible specification, the authors of [9] proposed a 
synthesis algorithm based on NOT, CNOT and Toffoli gates which represents 
a given permutation as a product of pairs of disjoint transpositions (2-cycles) 
and synthesizes each pair subsequently. A general permutation should be de- 
composed into a set of 2-cycles to be synthesizable using their approach. In this 
paper, a fc-cycle-based synthesis method is proposed and analyzed in detail. We 
show that direct synthesis of large cycles in a cycle-based synthesis scenario 
can lead to a significant reduction in quantum cost. In order to achieve this, 
several building blocks (BBs) and synthesis algorithms are proposed to be used 
in the proposed fc-cycle-based synthesis method. In addition, a decomposition 
algorithm for the synthesis of a general large cycle considering the suggested 
building blocks is introduced and analyzed. Based on the characterization of 
the proposed synthesis method, a hybrid synthesis framework, which uses the 
cycle-based synthesis approach in conjunction with one of the recent methods 
[^, is also presented. Furthermore, the average-case and worst-case quantum 
costs of the proposed synthesis framework are experimented and analyzed in 
detail. 

The main contributions of this paper are as follows. 

• The analysis of cycle-based synthesis approach and its usefulness in syn- 
thesizing reversible functions with different characterizations, 

• A fc-cycle-based synthesis method with guaranteed convergence, 

• A hybrid synthesis framework based on the proposed fc-cycle-based syn- 
thesis method together with the method of [6], 
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• The improved quantum cost in the worst-case scenario compared to the 
previously presented methods, 



• Better average quantum costs for available benchmark functions in the 
NCT library, 

• Improved average runtime compared to the present synthesis algorithms 
with favorable synthesis costs. 

The rest of this paper is organized as follows: In Section [2] basic concepts 
are introduced. The proposed cycle-based synthesis method is presented in Sec- 
tion [3] where the building blocks and their synthesis algorithms are proposed 



in Subsection 3.1 the decomposition algorithm and the fc-cycle-based synthe- 



sis method are explained in Subsection 3.2 and the worst-case analysis of the 



proposed cycle-based approach is discussed in Subsection |3.3| Experimental re- 
sults and the hybrid synthesis framework are proposed in Section [4] and finally, 
Section [5] concludes the paper. 



2 Preliminaries 

Let A be a set and define f : A ~> A as a, one-to-one and onto transition 
function. The function / is called a permutation function as applying f to A 
leads to a set with the same elements of A and probably in a different or- 
der. If A — l,2.3,---,m there exist two elements and Oj belonging to 
A such that /(a^) = Uj. In addition, a k-cycle with length k is denoted as 
(fli, 02, • • • , Cfc) which means that /(oi) = 02, 7(02) — 03, and /(a^) — ai. 
A given /c-cycle (oi, 02, • • • , a^) could be written in many different ways such as 
(02, 03, • • • , afc, ai). A cycle with length 2 is called transposition. 

Cycles ci and C2 are called disjoint if they have no common members, i.e., 
Voj € cijOi (ji C2 and vice versa. Any permutation can be written uniquely, 
except for the order, as a product of disjoint cycles. If two cycles c\ and C2 are 
disjoint, they can commute^ i.e., C\C2 = C2C1. In addition, a cycle may be written 
in different ways; as a product of transpositions and using different numbers of 
transpositions. A cycle (or a permutation) is called even if it can be written as 
an even number of transpositions. A similar definition is introduced for an odd 
cycle. Although there may be too many ways to decompose a given cycle into 
a set of transpositions, the parity of the number of transpositions used stays 
the same, i.e., all resulted decompositions have the same even/odd number of 
transpositions. A fc-cycle is odd (even) if k is even (odd). 

An n-input, n-output, fully specified Boolean function is reversible if it maps 
each input pattern to a unique output pattern. In this paper, n is particularly 
used to refer to the number of inputs/outputs in a circuit. A gate is called 
reversible if it realizes a reversible function. A generalized Toffoli gate C™NOT 
{xi,X2^ ■ ■ ■ ,Xm-)_i) passes the first m lines unchanged. These lines are referred 
to control lines. This gate flips the (m -|- 1)*'* line if and only if the control lines 
are all one. Therefore, the generalized Toffoli gate works as follows: Xi(^out) = 
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k Toffoli gates with (j^.i) CNOT gates 

common controls 

(a) (b) 

Figure 1: Construction of k Toffoli gates with common controls 

Xi{i < m + i),Xm+i{out) = X1X2 ■ ■ ■ Xm © Xm+1- Foi TO = and to = 1, the 
gates are called NOT{N) and CNOT{C), respectively. For to = 2, the gate 
is called C^NOT or Toffoli{T). These three gates compose the universal NCT 
library and are used in quantum computation frequently [5]. Outputs that are 
not required in the function specification are considered as garbage or auxiliary 
bits. The number of elementary gates required for simulating a given gate is 
called quantum cost. 

It has been shown that for n > 5 and m e {3, 4, • • • , \ n/2\}, a C™NOT gate 
can be simulated by a linear-size circuit which contains 12to — 22 elementary 
gates. In addition, for n > 7, a C"~^NOT gate can be simulated by 24n — 88 
elementary gates with no auxiliary bits [7]. On the other hand, a C"~^NOT 
gate can be simulated with an exponential cost 2" — 3 if no garbage line is 
available [10]. To avoid the exponential size and the need for a large number of 
elementary gates, several researchers used an extra garbage line for an efhcient 
simulation of C"~^NOT gate (e.g., [B]). Generally, the number of available 
bits is very restricted in today's reversible and quantum implementations 
Therefore, for two circuits with equal linear costs, the one without garbage line 
is preferred. The implementation of k Toffoli gates with common controls can 
be done by 2 A: + 3 elementary gates as illustrated in Fig. [T] [12]. Note that a 
Toffoli gate has the cost of 5 whereas NOT and CNOT gates have unit costs. 

The authors of (HJ proposed an NCT-based synthesis method which applies 
NOT, Toffoh, CNOT and Toffoh gates in order (the T\C\T\N synthesis method) 
to synthesize a given permutation. For the last Toffoli part, the authors pro- 
posed a synthesis algorithm that maps distinct a, 6, c and d (a, &, c, c? ^ 0, 2* 
to have at least two ones in their binary representations) to 2" — 4, 2" — 3, 
2" — 2 and 2" — 1 using a circuit called tt by at most bn — 2 Toffoli gates. Then, 
the permutation (2" — 4, 2" — 3) (2" — 2, 2" — 1) is implemented by a circuit, 
Ko, using 8(n — 5) Toffoli gates and finally, the reversed tt circuit, i.e., tt^^, is 
applied to transform 2" — 4, 2" — 3, 2" — 2 and 2" — 1 into a, b, c and d, respec- 
tively. Therefore, the ttkott"-'^ circuit implements the permutation (a, 6)(c, d) 
where a, b,c,d 0, 2* by at most 18n — 44 Toffoli gates. 

In contrast, a given fc-cycle f={xo, xi,X2, - ■ ■ , Xk) is decomposed into a set of 
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transpositions in [5] by using the decomposition pattern f={xo,xi) {xk-i,Xk) 
(sq, X2,X3, - • • , Xk-i), recursively. Subsequently, each pair of the transpositions 
is implemented using the ttkott"^ circuit. The proposed approach leads to at 
most n NOT gates, CNOT gates and 3(2" + n + l)(3n - 7) ToffoU gates 0. 
An extension of [S] was suggested in [13] which produced better quantum cost 
by applying the unit-cost NOT and CNOT gates instead of using Toffoli gates 
with cost 5 in many situations. 

In this paper, the ttkott""'^ circuit is improved by a fc-cycle-based synthesis 
method. For the rest of this paper, we use the same notations as [9] for the 
Kq, tt, and tt~^ circuits. In all figures, the (n — 1)*'' bit represents the most 
significant bit (MSB) and is shown as the top line in the circuit representations. 
Similarly, the 0*'' bit represents the least significant bit (LSB) and is shown as 
the bottom line in the circuit representations. 

3 /c- Cycle-Based Synthesis Method 
3.1 Building Blocks 

In this subsection, direct synthesis algorithms for seven suggested building 
blocks (i.e., a pair of 2-cycles, a single 3-cycle, a pair of 3-cycles, a single 5- 
cycle, a pair of 5-cycles, a single 2-cycle (4-cycle) followed by a single 4-cycle 
(2-cycle), and a pair of 4-cycles) are introduced and evaluated. Consider a 
given 5-cycle /=(ai, 02, 03, 04, 05) defined in a 7-bit circuit. Assume that ai, 
02, as, 04, and 05 are neither nor 2* to have at least two ones in their binary 
representations. Applying the decomposition method of [Qi leads to (01,02) 
(03,04) (01,03) (01,05) transpositions which could be implemented by at most 
3 X (18n - 44) = 54n - 132 Toffofi gates with cost 2707T, - 660. However, we will 
show that a direct 5-cycle implementation of / reduces the total quantum cost 
to at most 60n — 144. 

The proposed synthesis method treats the zero and 2* terms different from 
the remaining terms. The first group is handled in a pre-process stage similar 
to the method presented in [9]. For an arbitrary fc-cycle (01,02, • • • ,Ofc) in the 
second group, it can be assumed that 01,02,- ••0^ ^ 0,2* and oi ^ 02 ^ • • • 
^ Ofe. Throughout this paper, the binary representation is used where CNOT 
and Toffoli control bits are demonstrated in bold face and the rightmost bit is 
numbered as the 0*'^ (least significant) bit. In order to use the decomposition 
algorithm proposed in [7j, we assume that n> 7. 

Lemma 3.1 The kq(2,2) circuit {Fig. ^ creates a pair of 2-cycles {2'^ — 4,2"^ — 3) 
(2" - 2, 2" - 1) by 24n - 88 elementary gates. 

Proof Lemma 20 of pi proves the correspondence between the Ko(2.2) circuit 
and above cycles. As for the cost, it can be obtained by applying the results of 

El- 
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Figure 2: The ^0(2.2) circuit 



Figure 3: The circuit of Theorem 3.1 
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Figure 4: Synthesis of an arbitrary pair of 2-cycles (a, 6) (c, d) 



According to Lemma 3.1 the ko(2.2) circuit implements the particular pair 
of 2-cycles (2" - 4, 2" - 3)(2" - 2, 2" - 1). In order to implement an arbitrary 
pair (a, h) (c, d), the circuit is divided into five parts as follows. First, the terms 
o, 6, c and d are changed to 4, 1, 2 and 2"~^ +3, respectively. Note that the first 
three terms have only one 1 in their binary representations. As shown in the 
following theorem, this characterization is used during the synthesis of a pair of 
2-cycles. Second, a circuit is applied to change 4, 1, 2 and 2"~^ -|- 3 to 2" — 4, 
2" — 3, 2" — 2 and 2" — 1 (i.e., the terms used in ^0(2.2) circuit), correspondingly. 
Afterward, the ^0(2,2) circuit is used which changes 2" — 4, 2" — 3, 2" — 2 and 
2" - 1 to 2" - 3, 2" - 4, 2" - 1 and 2" - 2, respectively. Applying the second 
and the first sub-circuits in the reverse order puts unwanted terms (i.e., all 
terms except a, 6, c and d) back to their original locations and implements the 
given pair of 2-cycles (a, 6) (c, d). Fig. |4] demonstrates the complete synthesis 
scenario. Theorem |3. 1| discusses the synthesis of an arbitrary pair of 2-cycles in 
more details. The synthesis procedures for other cycles are similar to the one 
explained here as shown later. 

Theorem 3.1 i^Syn-i^i method): An arbitrary pair of 2-cycles {a,b) {c,d) can 
be simulated by at most 34n — 64 elementary gates. 

Proof Since a, b, c and d are neither nor 2*, they should have at least two 
ones in their binary representations. Assume that the ci*'' bit of a is 1. One can 
use at most one CNOT gate whose control is on ci to set the 2"'' bit of a to 1. 
Subsequently, by using at most n—1 CNOT gates whose controls are on the 2"'' 
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bit other bits can be set to for converting a to 4 (i.e., • • • 0100). Assume that 
after applying these gates, &, c and d are changed to 6', c', and d', respectively. 
Since b' should have at least one 1 namely at position C2 (c2 ^ 2), b' can be 
converted to 1 (i.e., • • ■ 01) by at most n CNOT gates using a similar approach. 
Then, c' and d' may be changed to new numbers c" and d", respectively without 
changing 4. 

Subsequently, c" can be converted to 2 (i.e., • • • 010) by at most one Toffoli 
gate and n — 1 CNOT gates with no effects on 4 and 1. Finally, the last term 
can be converted to 2"^^ + 3 (i.e., 10 • • • Oil) by at most one Toffoh gate and 
n — 1 CNOT gates with no effect on the previous terms again. Therefore, at 
most 4n + 8 elementary gates are required to transform a, b, c and d into 4, 1, 2 
and 2"^^ + 3, respectively. Now, the circuit shown in Fig. [s] should be applied 



to change 4, 1, 2 and 2"^ + 3 to the terms used in /Co(2,2) circuit (Lemma 3.1 ) 
Considering the applied gates (at most 5n + 12 elementary gates), the terms a, 
b, c and d are changed to 2" — 4, 2" — 3, and 2" — 2 and 2" — 1, respectively (i.e., 
the 7r2,2 circuit). Now, by using the Ko(2,2) circuit with the cost of 24n — 88, 
the pair of 2-cycles (2" - 4, 2" - 3) (2" -2, 2" - 1) is implemented. Applying 
the 7r^2 circuit changes 2" — 4, 2" — 3, and 2" — 2 and 2" — 1 to a, b, c, and 
d, respectively. In addition, the circuit 7r^2 puts other unwanted terms back to 
their original locations. Therefore, by at most 34n — 64 elementary gates, the 
pair of 2-cycles (a, b) (c, d) can be implemented. 

Example 3.1 Assume that the pair of 2-cycles (5,3) (9,67) should be imple- 
mented in a circuit over 7 bits (i.e., n—7). According to the proof of Theorem 



3.1 the term 5 should be transformed to by a CNOT gate which has no effect 
on other terms. Similarly, 3 is transformed to 1 by a CNOT gate which changes 
the term 9 to 11 and 67 to 65. Then, 11 is transformed to 2 by two CNOT gates 
with no effect on other terms. Finally, 65 is transformed to 67 by a CNOT gate. 
See the first sub-circuit in Fig. [^/or more details. Now, the circuit shown in 
Fig. 1^ should be applied followed by the Ko(2.2) circuit. Finally, as illustrated 
in the last two sub-circuits of Fig. |^ the above gates (except Ko(2,2) circuit) 
should be used in the reverse order to construct the complete circuit. In Fig. 
[5[ the results of applying all gates on the term 67 are also represented by gray 
squares where only values 1 are shown for the sake of simplicity. As can be seen, 
applying all gates changes 67 to 9. 



Lemma 3.2 The ^0(3) circuit {Fig. creates the 3-cycle (2" --2'' ^-1, 2"-l, 
2"""'^ — 1) by 24n — 88 elementary gates where k = \n/2'\ . 

Proof As shown in Fig. |6) the gates C"NOT(n - 1, n - 2, • • •, fc, fc - 1), 
C'=NOT(0, 1, 2, •• •, fc-1, n-1), C™NOT(n-l, n-2, • • •, /c, fc-1), C'^NOT(0, 
1, 2, • • •, k—1, n—\) are applied consecutively in the ko(3) circuit. After applying 
the first C™NOT gate, the locations of 2*^ minterms (denoted as — 2*-', 

2"_2'= + l, . . ., 2"-l}) are changed. Particularly, 2"-2'=-i-l (i.e., 1 • • • 101 • • • 1 
where the underlined 1 is at the [k — 1)*'* position) e is changed to 2" — 1 
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Figure 5: The circuit of Example |3.1| 
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Figure 6: The Ko(3) circuit 



(n-3) NOT gates 

Figure 7: The circuit of Theorem |3. 2 1 



(i.e., 1 • • • l)(e El)- By applying the C'^NOT, the locations of 2"* minterms 
(denoted as E2={0><2'= + 2^-1, 1 x 2*^ + 2'= - 1, • • •, 2™"! x 2'= + 2'= - 1=2" - 1} 
are changed (2" — 1 G X)i X)2)- Among them, 2" — 1 is exchanged with 2"^^ — 1 
(i.e., 01 • • • 1) e Applying the third C™NOT gate puts all J2i minterms at 
their right locations except 2" — 2^^"^ — 1 and also changes 2" — 1 to 2" — 2^^"^ — 1. 
Finally, the last C*''NOT gate corrects the locations of all ^2 rnembers except 
2"^^ — 1 and 2" — 1. Considering all the exchanges, 2" — 2'^^^ — 1 is changed to 
2" - 1, 2" - 1 is changed to 2""! - 1, and 2""! - 1 is changed to 2" - 2''^^ - 1. 

For the second part of the lemma, note that the first and the third gates 
shown in Fig. |6]can be implemented by 2 x (12 x (n — [n/2]) — 22) elementary 
gates. Similarly, the second and the fourth gates can be implemented by 2 x 
(12 X \n/2] — 22) gates. Therefore, «;o(3) is implemented by cost 24n — 88. 

Theorem 3.2 {Syn^ method): An arbitrary 3-cycle {a,b,c) requires at most 
S2n — 82 elementary gates to be implemented. 

Proof Since a, b, and c are neither nor 2*, they should have at least two 
ones in their binary representations. One can use at most n CNOT gates to 
transform a to 2""^ (i.e., 10- • -0). After applying these gates, assume that b 
and c are changed to 6' and c', respectively. By using a similar approach, c' can 
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Figure 8: The ^0(3^3) circuit, 
k = [71/2] 



Figure 9: The circuit of Theorem |3. 3 



be converted to 2"~^ (i.e., 010 • • • 0) by n CNOT gates that may change h' to a 
new number h" without changing 2"~^. Finally, converting 6" to 2"~^ + 2*^"^ 
(i.e., 10 • • • 010 • • • where the underlined 1 is at the (k — 1)*'' position) can be 
done by one Toffoli and n—\ CNOT gates with no effects on the previous 2"~^ 
and 2"~^ terms. Therefore, by at most 3n + 4 elementary gates, a, h and c are 
transformed into 2"^^, 2""^ + and 2""^, respectively. Now, the circuit 
shown in Fig. [7] should be applied to change the recent terms to the terms used 
in Ko(3) circuit. 

Considering the applied gates (at most 4n + 3 elementary gates), the terms 
a, &, and c are changed to 2" — 2*^^^ — 1, 2" — 1, and 2"^^ — 1, respectively 
(i.e., the 7r3 circuit). By using the Ko(3) circuit with cost 24n — 88, the 3-cycle 
(2" — 2'^"-'^ — 1, 2" — 1, 2""^ — 1) is implemented. Applying the tx^^ circuit changes 
2" _ 2*^-1 - 1, 2" - 1, and 2"-^ - 1 to a, &, and c, respectively. Therefore, by at 
most 32n — 82 elementary gates, the 3-cycle (a, 6, c) can be implemented. It is 
worth noting that a single 3-cycle can be a BB by itself because it is even. As 
will be shown later, the same is true for a single 5-cycle. 



Lemma 3.3 The ^0(3, 3) circuit {Fig. \ 
2k-i _ 2" - 1, 2"~i - 1) (2" - 2''- 
elementary gates where k = \n/2~\ . 



implements the pair of 3-cycles (2" — 
- 2, 2" - 2, 2"-i - 2) by 24n - 112 



Proof It can be verified that the ko(3,3) circuit differs from the ^0(3) circuit in 
its least significant bit (i.e., the 0*'' bit) which leads to two 3-cycles. The first 
and the third gates need 12n — 44 elementary gates. The second and the fourth 
gates need 12n — 68 elementary gates. Therefore, Ko(3,3) can be implemented 
by the cost of 24n — 112 gates. 



Theorem 3.3 {Syus ^ method): The implementation of an arbitrary pair of 
3-cycles {a, b, c) {d, e, /) requires at most 38n — 46 elementary gates. 

Proof Use at most 6rt + 16 elementary gates to convert a to 2"~^ (i.e., 10 • • • 0), 
b to 2^^^^ (i.e., • • • 010 • • • where the underlined 1 is at the (k — 1)*'' position). 
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Figure 10: The Ko(4 2) circuit, k 
rn/21 



Figure 11: The circuit of Theorem |3.4| 



c to 1 (i.e., • • • 01), d to 2 (i.e., • • • 010), e to 2"-^ (i.e., 010 •• • 0), and / to 
2n-2 g (^[ q^^ 010 • • • 0110), sequentially. Therefore, the terms a, b, c, d, e, and 
/ are changed to 2"-\ 2''-\ 1, 2, 2"'-2, and 2""^ + 6, respectively. Note that 
the terms a and b can be implemented by only CNOT gates. For each of the 
other terms, at most one Toffoli and n~l CNOT gates should be applied. Now, 
apply the circuit shown in Fig. [9] After applying at most 7n + 33 elementary 
gates, a, 5, c, d, e, and / are transformed into 2" — 2'^"^ — 2, 2" — 2, 2"~^ — 2, 
2" — 2'^"^ — 1, 2" — 1, and 2"^^ — 1, respectively (i.e., tts 3 circuit). By applying 
Ko(3,3) and the reversed tts^s circuit, 38n — 46 elementary gates are used and (a, 
6, c) (d, e, /) is implemented. 



Lemma 3.4 T/ie Ko(4,2) circuit {Fig. 10-a) implements the pair {2"^ — A, 2^ — 1, 
2" - 3, 2" - 2) (2"-i - 2, 2"-i - 1) &^36n - 180 elementary gates. 



a 



Proof The first C""2N0T(n - 1, n - 2, • • •, 2, 1) gate shown in Fig. [lO 
changes 2" - 4, 2" - 3, 2" - 2, and 2" - 1 to 2" - 2, 2" - 1, 2" - 4, and 2" - 3 
respectively. The second C"-2N0T(n - 2, • • • ,2, 1, 0) changes 2" - 2, 2" - 1, 
2"-^ - 2 and 2"-^ - 1 to 2" - 1, 2" - 2, 2"-^ - 1 and 2"-2 - 2, respectively. 
Considering the gates sequentially leads to the implementation of ^0(4, 2)- The 



circuit in Fig. 10 b can be obtained by applying the Lemma 7.3 of jlOl on each 
C"~^NOT gate of Fig. 10 a and canceling the resulted redundant gates. The 



total number of 36n — 180 elementary gates can be achieved by a summation of 
the costs of gates in Fig. [TO|b. 



Theorem 3.4 (57/^4,2 method): An arbitrary pair {a, b, c, d) (e, /) can be 
implemented by at most 50n — 122 elementary gates. 

Proof Use at most 6n + 16 elementary gates to convert a to 4 (i.e., • • • 0100), 
c to 1 (i.e., • • • 01), d to 2 (i.e., • • • 010), e to 2"-^ (i_e., 010 •• • 0), / to 2"-^ 
(i.e., 0010 • • • 0), and b to 2"-^ + 3 (i.e., 10 • • • Oil), sequentially. Note that the 
terms a and c can be implemented by only CNOT gates. For each of the other 
terms, at most one Toffoli and n~l CNOT gates should be applied. Now, apply 
the circuit shown in Fig. [TT] After applying at most In + 29 elementary gates. 
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Figure 12: The Ko(4 4) circuit, k 
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(n-7) NOT gates 

Figure 13: The circuit of Theorem |3.5| 



the terms a, b, c, d, e, and / are changed to 2" - 4, 2" - 1, 2" - 3, 2" - 2, 
2n-i _ 2^ and 2"^^ — 1, respectively (the 7r4^2 circuit). Then, apply the ko(4,2) 
and the reversed 1x4^^2 circuit (i.e., ir^^) to complete the implementation of (a, 
6, c, d) (e, /) by at most 50n — 122 elementary gates. 



Lemma 3.5 The ko(4,4) circuit [Fig. 12 -a) implements (2" — 8, 2" — 2, 2" — 6, 



2" - 4) (2" - 7, 2" - 1, 2" - 5, 2" - Sy&j/ cosi 36n - 228 



Proof Consider Fig. [l2}a. The first C""3N0T(n - 1, n - 2, • • •, 3, 2) gate 
changes 2" - 8, 2" - 7, 2" - 6 and 2" - 5 to 2" - 4, 2" - 3, 2" - 2 and 2" - 1, 
respectively. The second C"~^NOT(n - 1, n - 2, • • •, 2, 1) gate changes 2" - 4, 
2" - 3, 2" - 2, and 2" - 1 to 2" - 2, 2" - 1, 2" - 4, and 2" - 3, respectively. 
Considering the gates sequentially leads to the implementation of the cycle. 
Applying the Lemma 7.3 of on each gate shown in Fig. [T2[a and canceling 
the resulted redundant gates transform Fig. [T2]-a to Fig. |12fb. The total 
number of 36n — 228 elementary gates can be obtained by summation of the 
costs of gates shown in Fig. [T2]-b. 



Theorem 3.5 {Syn^ i method): An arbitrary pair (a, b, c, d) (e, /, g, h) can 
be implemented by at most 56n — 126 elementary gates. 

Proof Use at most 9n + 22 elementary gates to sequentially convert a to 8 (i.e., 
• • • 01000), c to 2 (i.e., • • • 010), d to 4 (i.e., • • • 0100), e to 1 (i.e., • • • 01), / 
to 2"-2 (i.e., 010 •• • 0), g to 2""! (i.e., 10 • • • 0), /i to 2"-^ (i.e., 0010 • • • 0) and 
6 to 14 (i.e., • • • OHIO). Note that a and c can be transformed to 8 and 2 by 
only CNOT gates, respectively. In addition, for each term d, e, /, g, and h at 
most one Toffoli and n— 1 CNOT gates should be used. For the last term b at 
most two Toffoli gates should be used to set the 2"*^ and 3'''* bits to 1. Then, at 
most n — 2 Toffoli gates should be applied to set the 1** bit to 1 and the i*'' bit 
to where 0<i<n— l,i^l,2, 3. The n — 2 Toffoli gates can be implemented 
by cost 2(n — 2) + 3 (see Fig. [T]) since all Toffoli gates use the same control 
lines (i.e., the 2"'* and 3'''' bits). Note that for n > 8, the term b can also be 
implemented by at most one Toffoli and 71 — 1 CNOT gates. Now, apply the 
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Figure 14: The Ko(5) circuit 



Figure 15: The circuit of Theorem |3.6| 



circuit shown in Fig. [13) After applying at most lOn 
b, c, d, e, /, g, and h are changed to 2" — 8, 2" — 2, 2" - 



- 51 elementary gates, a, 
6, 2" -4, 2" -7, 2"-l, 



2" — 5, and 2" — 3, respectively (7r4_4). Then, apply Ko(4.4) and 4 to complete 



the implementation of (a, 6, c, d) (e, /, /i) by at most 56n - 
gates. 



126 elementary 



Lemma 3.6 The kq^^t^-^ circuit (Fig. 14\ 
2" - 2""2 _ 1 2""i - 1, 2" - 2"~3 - 



implements the 5-cycle (2" ^- 
) with cost 48n — 166. 



1, 2" 



Proof As illustrated in Fig. [Uj four gates T (71 - 1, n-2, n-3), C""2]s^OT(0, 
• • •, n - 3, n - 1), T(n - 1, n - 2, n - 3), C"~2nOT(0, • • •, n - 1, n - 2) are 
applied sequentially. After applying the first Toffoli gate, the locations of 2"~^ 
minterms (i.e., J2i = {2" - 2""^ 2" - 2"-^ + l, . . 2" - 1}) are changed. 
Mainly, 2" - 2"-=' - 1 (i.e., 1101 ... 1) G is changed to 2" - 1 (e J^i)- After 
the second C"~^NOT, the locations of 4 minterms (denoted as J22~{'^"^^ ~ 
2"-i-l, 2"-2"-2-l, 2"-l}) are changed (where 2"-l e Y^^nY^^). Among 



them, 2" — 1 is changed to 2" 



1 e and 2'^ 



1 is changed to 2" — 1. 



Applying the third Toffoli gate puts all Yi rninterms at their right locations 
except 2" - 2""^ - 1. In addition, it changes 2" - 1 to 2" - 2"-^ - 1. Finally, 
the last C"^^NOT gate changes the locations of four minterms as 2"^^ — 1 to 
2"-2 - 1, 2" - 1 to 2" - 2"-2 _ 1, 2" - 2"-2 - 1 to 2" - 1, and 2"-^ - 1 to 
2"~^ — 1. Considering all minterm exchanges, it can be verified that the 5-cycle 
Ko(5) is implemented by the circuit of Fig. 



14 



The total number of 48n — 166 

elementary gates can be obtained by a summation of the costs of gates in Fig. 

M 



Theorem 3.6 {Syn^ method): An arbitrary 5-cycle (a, b, c, d, e) can be im- 
plemented by at most 60n — 130 elementary gates. 

Proof Use at most 5n+12 elementary gates to convert a to 2"^"^ (i.e., 0010 • • • 0), 
d to 2"-2 (i.e., 010 •• • 0), c to 2"-i (i.e., 10 • • • 0), e to 2""* (i.e., 00010 • • • 0) 
and b to 2""! + 2"^^ ^ 2"-^ ^ i (j 1110 • • • 01), sequentially. Note that a and 
d can be transformed to 2"^'^ and 2"^^ by only CNOT gates, respectively. For 
each of the other terms at most one Toffoli and n — 1 CNOT gates should be 
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Figure 16: The ^0(5^5) circuit, k = \{n — l)/2] 
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Figure 17: The circuit of Theorem |3.7| 



used. Then, apply the circuit shown in Fig. [TSj After using the apphed gates 
(at most 6n + 18 elementary gates), the terms a, 6, c, d, and e are changed to 
2"-2 - 1, 2" - 1, 2" - 2"-2 - 1, 2"^i - 1, and 2" - 2"-^ - 1, respectively (tts). 
Therefore, by applying the Ko(5) circuit and the tt'^^ circuit, the 5-cycle (a, b, c, 
d, e) is implemented by at most 60n — 130 elementary gates. 



Lemma 3.7 The ^0(5, 5) circuit {Fig. 16] implements the pair of 5- cycles {T'' 
-2, 2"-2, 2"-2"-2-'2, 2"-i-2, 2"-2"~3_2) (2"-2-l^ 2"-l, 2"-2"-2- 
1-1,2"- 2"-3 



2" 



-2, 2"-^ -2, 2"-2"~'^ 
- 1) by cost 36n — 206. 
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a differs from 



Proof It can be verified that the ^0(5, 5) circuit shown in Fig. 
the Ko(5) circuit in its least significant bit (i.e., the 0*'* bit) which results in 
two 5-cycles. Applying Lemma 7.3 of [10] on each gate shown in Fig . [l6| -a and 
canceling the resulted redundant gates transformed Fig. [T6]-a to Fig. |16fb. The 
total number of 36n — 206 elementary gates can be obtained by a summation of 
the costs of gates shown in Fig. [T6|b. 



Theorem 3.7 {Syn^^^ method): An arbitrary 5-cycle (a, b, c, d, e) (/, g, h, i, 
j) can be implemented by at most 64n — 54 elementary gates. 
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Table 1: Maximum cost comparison for the proposed BBs 



BB 


Length 




Our Approach 

TT, TT^i Total 


Cost/Length 


[T3l 

Total 


(2,2) 


4 


24n-88 


5n+12 


34n-64 


8.5n-16 


34n-64 


(3) 


3 


24n-88 


4n+3 


32n-82 


10.7n-27.3 


68n-128 


(3,3) 


6 


24n-112 


7n+33 


38n-46 


6.3n-15.3 


68n-128 


(2,4) 


6 


36n-180 


7n+29 


50^-122 


8.3n-20.3 


68n-128 


(4,4) 


8 


36n-228 


lOn+51 


56^-126 


7n-15.7 


102n-192 


(5) 


5 


48n-166 


6n+18 


60n-130 


12n-26 


102n-192 


(5,5) 


10 


36n-206 


14n+76 


64n-54 


6.4n-5.4 


136n-256 



Proof Apply at most 13n+47 elementary gates to convert a to 2 (i.e., • • • 010), 
to 4 (i.e., • • • 0100), e to 2"-^ (i.e., 00010 • • • 0), / to 2"-3 (i.e., 0010 ■■■0),h 
to 2"-i (i.e., 10 - --O), i to 2"-2 (i.e., 010-- - 0), j to 1 (i.e., • • • 01), 6 to 2"-i+4 
(i.e., 10 • • • 0100), c to 2"-i + 2 (i.e., 10 • • • 010) and g to 2"-^ + 2"-^ + 2"-^ 
(i.e., 1110 • • • 0) sequentially. Note that a and d can be transformed to 2 and 4 
by only CNOT gates, respectively. In addition, for each of other terms e, /, h, 
i, and j at most one Toffoli gate and n—1 CNOT gates should be used. For the 
last three terms b, c, and g at most two Toffoli gates should be used to set the 
control bits to 1. Then, at most n — 2 Toffoli gates should be applied for each 
term. For n > 10, the terms b, c, and g can also be implemented by at most one 
Toffoli gate and n—1 CNOT gates which lead to lOn + 32 elementary gates. 
Now, apply the circuit shown in Fig. |17[ By using at most 14n + 76 elementary 
gates, the terms a, b, c, d, e, /, g, h, i, and j are changed to 2"~^ — 2, 2" — 2, 
2" - 2""2 _ 2, 2"-i - 2, 2" - 2""3 - 2, 2""^ _ 2" - 1, 2" - 2""^ „ i 2"-i _ l, 
and 2" — 2"^-^ — 1, respectively (the ttjs 5) circuit). Then, apply the ^0(5, 5) and 
the TTj^^g^ circuit to implement the cycles (a, 6, c, d, e) (/, g, h, i, j) by at most 
6An — 54 elementary gates. 

So far, direct implementations of the selected building blocks have been 
studied. Table [T] shows a summary of the achieved results for direct implemen- 
tations of the selected building blocks. In this table, the maximum number of 
elementary gates of our direct synthesis method and the 2-cycle-based method 
[13] for the set of proposed building blocks are compared. As demonstrated in 
this table, the direct fc-cycle-based implementation has a significant potential 
to reduce the cost. However, as the direct implementation of a general fc-cycle 
could be very hard, in this paper a decomposition algorithm is also proposed to 
be used in conjunction with the selected set of building blocks. 

3.2 Decomposition Method 

In the rest of this paper, 2, 3, 4 and 5 cycles are called elementary cycles. For 
an arbitrary single permutation P, we would like to decompose it into a set of 
elementary cycles like ci, C2, such that applying P would be identical to 
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applying ci, C2, Ck, sequentially; and ci, C2, Ck as well as P would belong 
to a single permutation group. 

To describe the decomposition method, the following notations are used: P 
as an input permutation, m as the maximum cycle length available in P, Ck as 
a cycle of length k, Ck^i(k) a-s the set of i{k) cycles each of which is of length k, 

Cfc — *(^)) ^ cycle of the cycle set Ck^i{k)j N{k) as the number of 

disjoint 5-cycles in a given fc-cycle, L{k) as the length of a given fc-cycle after 
detaching N{k) disjoint 5-cycles, and E{k) as the length of a given fc-cycle after 
detaching all of the available disjoint/non-disjoint 5-cycles in the given fc-cycle. 

Any permutation P can be written uniquely, except for the order, as a 
product of disjoint cycles. Without loss of generality, we assume that P = 
Cm,i(m) C'„_i^i(„i_i) ••• C3_,(3) C2,,(2) whcre Vfc G (2, m): i{k) > 0. For 
each Cfe (fc > 5) in P, Ck^i(k) is decomposed into a set of cycles of lengths 
5, 4, 3, and 2, sequentially. In addition, for any two cycles Ck,i{k) and Cj^n^f) 
(fc > j), Ck,i{k) is processed first. Consider a given fc-cycle (1, 2, 3, 4, • • •, k) 
(fc > 5). It is possible to decompose it into two cycles (1, 2, 3, 4, 5) (6, 7, • • •, fc, 
1) of length 5 and (fc — 4), respectively. Repeating the process leads to N{k)= 
[fc/5j disjoint 5-cycles and a cycle of length L{k)=N{k)+{k mod 5) with some 
non-disjoint members. This process is called the 5-cycle extraction method in 
the rest of the paper. 

Since Ck^i(k) Vfc G (2, • • •, m) contains i{k) cycles of length fc, one can write 
Ck,i(k)= Cl ci ■■■ C^^''\ For each Ck and by using the 5-cycle extraction 
method, Cfc=C5a Cfc_4,i=C'5,2 Ck-8,i=-.=C5^N(k) CL(k),i- Repeating this pro- 
cess for L(fc), L{L{k)), etc. lead to Ck^C^^Nik) C^^NiUk)) C5^N{L{L(k))) ■■■ 
C5 jv(L(L---(fc))) C^Eik),!- Note that E{k) is smaller than 5. Since there are i{k) 
cycles of length fc, Ck^i(k) = C'5,Af(fe)xi(fc) C5,N(L(k})xi{k) C5,Af(L(L(fc)))xi(fc)j 

C5,iV(i(L---(fc)))xi(fc) CE(k),i{k)- 

It can be verified that the resulted elementary cycle of a fc-cycle (fc > 5) has 
no common members with other cycles. In addition, all disjoint/non-disjoint 
5-cycles (detached from a fc-cycle) are disjoint over other cycles. Therefore, the 
input permutation P can be written as ([l]). See Example 3.2 for more details. 



P- (C5,W(fc)x»(fc))rfcL5 {C5^N{L{k))xt{k))\l'^5--- 

(C5,Ar(L(L(...(fe)))xi(fc)) I j,^5 C'4,i'(4)C'3,4'(3)C'2,i'(2) 



where 






*'(4) - 


z(4)4 


m 




^(3)4 


m 

- J2 *(fc)lB(fe)==3' 
fc=5 


*'(2) = 


z(2)4 


m 

- J2 *(fc)ls(fe)==2 



(1) 



Example 3.2 Consider P ^ (3, 5, 6, 7, 9, 10, 11, 12, 13, U, 15, 17, 18, 19, 

20, 21) {22, 23, 24, 25, 26, 27) {28, 29) {30, 31) written as Cig^^ie) C6,^(6) 
C2.i(2)- It can be verified that m — 16, j(16) = 1, i{6) = 1, i{2) — 2, and 
i{k) ^Oforke (3, 4, 5, 7, 8, • • 15). We have: 
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• A: = 16 



- A^(16)= [16/5J=3, L(16)=3+l=4 

- A^(L(16))=iV(4)==0, L(i(16))=4=S(16) 

• k — 6 

- A^(6)=l, L(6)=2=£;(6) 

• k^2 

- N(2)^0, L{2)=2=E(2) 

Therefore P= (Cs.a C5.1) (Q.i C2.3) = (3, 5, 6, 7, 9) (10, 11, 12, 13, 14) (15, 17, 
18,19,20) (22,23,24,25,26) (21,3,10,15) (22,27) (28,29) (30,31). 

Considering the 5-cycle extraction method, the extraction time complexity 

of each fc-cycle can be written as 0{k) + 0{L(k)) + 0{L{L{k)) H h 0{E{k)) 

< O^rjk) where 77 is an integer smaller than k. Therefore, each given fc-cycle 
is processed with the time complexity of 0{k). On the other hand, as there 
are i{k) > cycles of length fc, the total time complexity of the decomposition 
method is 0{m) x i{m) + 0{m — 1) x i{m — 1) + • • •+ 0(2) x i{2) where 
0(i(fc))=0(2"/fc), fc < 2" for fc e (2 - • - to). Therefore, we have 0(m) x i{m) 
+ 0{m -\)xi{m-l) + ■■■+ 0{2) x i(2)=0(m2)=0(22") as to < 2". It is 
important to note that the decomposition algorithm of |13j works with the same 
0(2^") time complexity. After the decomposition stage, the resulted elementary 
cycles should be implemented by using the proposed synthesis algorithms. Note 

that the total number of extracted 5-cycles is 0{k)+0{L{k))-\-0{L{L{k)))-\ 

which is equal to 0{k). Considering all fc-cycles (fc > 5), the total number of 
5-cycles is 0(2^") as explained above. In addition, as each fc-cycle (fc > 5) 
could produce at most one elementary cycle with length 2, 3 or 4, the total 
number of elementary cycles is at most X]fc=2 - m Therefore, the 

total number of elementary cycles is 0(2^") that leads to the time complexity of 
0(2^") x 0{Synthesis Algorithm). It can be verified that the proposed synthesis 
algorithms for the elementary cycles are of 0{n). As a result, the total time 
complexity of the proposed approach is 0(2^" x n), the same as [T^ . 

To count the maximum number of elementary cycles in the proposed method, 
note that the number of 5-cycle pairs, 3-cycle pairs and 4-cycle pairs resulted 
from the decomposition algorithm are Num^^^ — X]fe=5 - m ^{k) + N[L{k)) + • • -J , 
Num^s = ^j'(3), and Num44 = ^i'(4), respectively. On the other hand, at 
most one single 5-cycle, one single 3-cycle and one 4-cycle followed by a 2-cycle 
are produced, i.e., Num<^ = niod(^j,^g...^ iV(fc) -t- N{L{k)) -I- • ■ ■, 2), Num^ 
= mod(z'(3),2) and Numi^2 = mod(i'(4), 2). Finally, the number of 2-cycle 
pairs is Num2.2— Wi^'i"^) ~ Numi,2)\- Altogether, the maximum number of 
elementary gates resulted in the proposed fc-cycle-based synthesis method can 
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Stepl 

Fix and 2' terms use a pre-process stage as done in [5]. 

Step2 

if n < 7 

1- Decompose the input permutation into a set of 2-cycles. 

2- Apply 5*2/^2,2 to synthesize ah 2-cycles 
else 

1- Decompose the input permutation into a set of 5, 4, 3, and 2 cycles 

2- Synthesize all disjoint 5-cycle pairs {Syn^^^) 

3- Synthesize single 5-cycles (Syn^) 

4- Synthesize all disjoint 3-cycle pairs {Syns^^) 

5- Synthesize single 3-cycles (Syn^) 

6- Synthesize all disjoint 4-cycle pairs {Syui^i) 

7- Synthesize all disjoint 4-cycle and 2-cycle pairs {Syn4^2) 

8- Synthesize all disjoint 2-cycle pairs {Syn2^2) 



Figure 18: The fc-cycle-based synthesis method 

be expressed by ([2]). See the following examples for more details. 

TVuTOg^s X (64n - 54) + Numz x (60n - 130) + Nums^s x (38n - A6)+ 
Nums X (32n - 82) + Nunii^i x (56n - 126) + Numi^2 x (50n - 122)+ 
Num2 2 X (34n - 64) 

(2) 



Example 3.3 Again, reconsider the permutation of Example 3.2. P = Cig.i 
Ce.i C2,2 where Num^^s = \ L(7V(16) + A^(6))J = 2, Num^ = 0, Num^^^ = 6, 
Num^ = 0, Num4^4 = 0, Num4^2 = 1? o,nd Num2.2 = [^(3 ^ 1)J ^ 1- -^t most 
2 X (64n - 54) -l-(50n - 122) +{Mn - 64) = 212n - 294 elementary gates are 
produced using our k-cycle-based synthesis method. 

Example 3.4 Let P={3, 5, 6, 7, 9, 10, 11, 12, 13, 14) (15, 17, 18, 19, 20, 21) 
(22, 23, 24) (25, 26, 27) (28, 29, 30) (P = Cio,i Ce^ Cg.s}- After decomposition, 
we have P — 6*5,3 6*3,3 6*2,2- After applying the proposed method, Num^^^ = 
1, Num^ — 1, Num^^s — 1, Num^ — 1, Nuni^^i ~ 0, Num4,2 = 0, Num2,2 
= 1 and at most 2 x (64n - 54) -I- (60n - 130) + (38n - 46) + (32n - 82) + 
(34n — 64) 292n — 430 elementary gates are produced. 

As stated at the beginning of Section [s] the zero and 2* terms are fixed by 
applying a few Toffoli and CNOT gates as done in [9] . In addition, For small n 
(i.e., n <7), the decomposition algorithm is modified to produce only 2-cycles 
where each cycle pair is synthesized by the Syn2,2 method. The complete k- 
cycle-based synthesis method is shown in Fig. [I8] For n > 7, a given permuta- 
tion is recursively decomposed into a set of elementary cycles each of which is 
synthesized by the synthesis algorithm listed in parentheses as discussed. 
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Theorem 3.8 The proposed k-cycle-based synthesis method always converges. 

Proof According to the proofs of Theorem |3.1| to Theorem |3.7[ the suggested 
building blocks (i.e., a pair of 2-cycles, single 3-cycle, a pair of 3-cycles, single 
5-cycle, a pair of 5-cycles, a single 2-cycle (4-cycle) followed by a single 4-cycle 
(2-cycle), and a pair of 4-cycles) can always be synthesized for any arbitrary 
values of cycle elements for n > 7 as far as each cycle element is neither nor 
2*. In addition, by using the proposed decomposition algorithm, a given large 
cycle can always be decomposed into a set of elementary cycles. For small n (i.e., 
n < 7) , the decomposition algorithm produces only 2-cycles where each pair can 
always be synthesized by the Syn2^2 method. Considering the pre-process stage 
for the zero and 2* terms and the synthesis scenarios for n < 7 and n > 7 as 
explained above lead to the theorem. 

3.3 Worst Case Analysis 

To analyze the total number of elementary gates resulted from the proposed 
fc-cycle-based synthesis method in the worst case, assume that the maximum of 
m members (ai, 02, • • •, am) of a given permutation P are moved. As each Ofc, 
k G (2, • • • , m) is neither nor 2% m is equal to 2" — n — 1 for an even n and 
equal to 2" — n — 2 for an odd n. 

Theorem 3.9 The maximum number of elementary gates in the proposed cycle- 
based synthesis method is calculated by 8.5n2" -|- o(2"). 

Proof In order to place each row at its right position, several reversible gates 
should be applied in the proposed method. The worst-case cost occurs for the 
maximum number of changed rows (i.e., m ~ o(2")). The synthesis costs listed 
in Table [1] (i.e., Cost/Length) indicate that the cost of correcting a single row is 
8.5n — 16 for a pair of 2-cycles, 10. 7n — 27.3 for a single 3-cycle, 6.3n — 15.3 for 
a pair of 3-cycles, 8.3n — 20.3 for a single 2-cycle followed by a single 4-cycle, 
7n — 15.7 for a pair of 4-cycles, 12n — 26 for a single 5-cycle and 6.4n — 5.4 for 
a pair of 5-cycles. 

For a decomposition with 2" changed rows, there are at most one single 
5-cycle and one single 3-cycle. Considering the cost of 12ri — 26 for correcting a 
single row in a single 5-cycle, 10. 7n — 27.3 in a single 3-cycle and 8.5?! — 16 in a 
pair of 2-cycles, it can be verified that the worst-case cost for a decomposition 
with 2" changed rows is 8.5n2" + o(2"). 

Theorem |3.9| shows a lower upper bound for /c-cycle-based synthesis method 
compared to the best reported upper bound of lln2" -|- o(n2") for the synthesis 
algorithm proposed in [6j. Given the fact that the 8.5n2" term is dominant over 
the 0(2") term, the former will be used in the remainder of this subsection for 
cost analysis. 

Reversible logic has application in quantum computing [10] , [5] . Most quan- 
tum algorithms presume that interaction between arbitrary qubits is possible 
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with no extra cost. However, some restrictions exist in real quantum tech- 
nologies (fSj . For example in a Linear Nearest Neighbor (LNN) architecture, 
only adjacent qubits may interact. The implementation complexity with lim- 
ited interaction depends on the relative target and control positions. It can 
be modeled by using a sequence of SWAP gates to move controls and targets 
close to each other to construct appropriate gates. Theorem |3.10| examines the 
proposed method for LNN architecture. 

Theorem 3.10 The maximum number of elementary gates in the proposed k- 
cycle-based synthesis method for LNN architecture is equal to 51n^2". 

Proof To prove, the number of required SWAP operations performing a 2- 
qubit gate g with control c and target t has to be found. We assume c > t. 
It can be verified that (c — t — 1) SWAP operations are required to bring the 
control adjacent to the target, one gate is required to perform g, and the same 
sequence of {c — t — 1) SWAP operations are required to return value of the 
i*'* {t < i < c) qubit to its initial value. Considering a cost 3 for each SWAP 
operation leads to 6 x {c — t — 1) + 1. The case oi c < t can be readily deduced 
by following the same approach. 

The theorem can be proven by using Theorem |3.9| and plugging in the cost 
found above. 

4 Experimental Results 

The proposed fc-cycle-based synthesis method and the 2-cycle-based algorithm 
presented in |13j were implemented in C-f-|- and all of the experiments were 
done on an Intel Pentium IV 2.2GHz computer with 2GB memory. In addition, 
we used one of the most recent and efficient NCT-based synthesis tools proposed 
in [6] for our comparisons. This method used Reed-Muller (RM) spectra in an 
iterative synthesis procedure (RM-based method). In all experiments, the post- 
processing algorithm proposed in [13] was applied to simplify circuits produced 
by our synthesis method and the algorithm of [13 . In this method, optimal 
circuits for all 40320 3-input reversible functions and a large set of 4-input 
circuits were generated and stored in a compact data-structure. As a result, 
applying the post-processing algorithm of [13] leads to optimal results for all 
3- and some 4-input specifications. The synthesis algorithm of [Hj was applied 
in "synthesized/ resynthesized using 3 methods" mode for circuits with n < 15 
and in "synth/resynth with MMD (15+ variablesY^ ior n > 15. In addition, the 
synthesis algorithm, the template matching method, the random and exhaustive 
driver algorithms were applied sequentially to synthesize each function with a 
time limit of 12 hours as done in |6^. Bidirectional and quantum cost reduction 
modes were also applied. 

To evaluate the proposed synthesis method, the completely specified re- 
versible benchmarks from |16j were examined. In addition, the best documented 
synthesis costs available at [16] resulted from applying different NCT-based syn- 
thesis tools were used for our comparisons. In some cases, the synthesis results 
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in NCT library for some benchmarks have not been reported yet (these func- 
tions are N-th prime functions over more than 7 bits, hamming coding functions 
(hwb) over more than 11 bitt[^ and permanent functions). In those cases, we 
applied the synthesis method of jj] which works efficiently in terms of quan- 
tum cost with a time limit of 12 hours. If it failed to synthesize a function in 
the given time limit (for hwb functions over more than 11 bits and N-th prime 
functions over more than 10 bits, the algorithm failed), the method of [13] was 
applied. All synthesis algorithms were compared in terms of the quantum cost 
as done in [16]. Our actual circuits are available from [18]. 

The results of the proposed /c-cycle-based synthesis method (Pure fc-CYCLE) 
and the best synthesized circuits resulted from the previous NCT-based synthe- 
sis algorithms (Best Results) were shown in Table [2] A comparison of the 
synthesis costs of the proposed fc-cycle-based method and the best reported 
ones reveals that the cycle-based approach treats differently in terms of the 
quantum cost for different benchmarks (for examples see the results of hwbll 
and cyclel0_2). In the rest of this section, by analyzing the characteristics of 
different benchmarks, a hybrid synthesis framework is proposed which uses the 
cycle-based method in conjunction with the method of [S] to synthesize a given 
function. As shown later, the proposed hybrid framework can improve the av- 
erage quantum costs efSciently. 

To evaluate the behavior of fc-cycle-based synthesis method, a Distance met- 
ric is defined as ([s]) for each reversible function / where < Distance{f) < 1. 

i=2"-l 

Distance{f)^ ^ \f{i) - 1\/{2^^-^) (3) 

1=0 

For a given function /, Distance(f) models the distribution of output code 
words compared with the identity function. Fig. [19] shows the distributions of 
output code words for three benchmarks. As illustrated in this figure, ham? 
{Distance{f) = 0.38) and cyclel0_2 {Distance{f) = 0.001) are more similar 
to the identity function (/(i) — i, Distance(f) — 0) compared with hwblO 
(Distance{f) = 0.62). The distributions of output code words for other func- 
tions were reported in Table [2] (i.e., DiST.). 

Based on the characterization of a reversible function, we divided bench- 
marks into three categories as shown in Table [2] (Cat.). Category 1 includes 
small functions with less than seven inputs. Category 2 and category 3 in- 
clude large functions with n > 7 but with different distribution levels. In other 
words, for each function in category 2 (3), Distance{f) is greater (less) than 
0.5. By applying a hybrid synthesis framework, functions in different categories 
are handled differently as shown in Fig. [20| 

For functions in category 1, we applied the cycle-based synthesis method 
first. Then, the random_driver procedure introduced in [6] was applied. Since 
category 1 includes small functions, applying the random_driver method for 

^For hwb functions, polynomial size reversible circuits in NCTF library (NCT library plus 
the Fredkin gate 17 ) with {log{n)] + 1 garbage bits and 0{nlog^(n)) gates exist |16l . 
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Table 2: The comparison costs of the proposed synthesis framework. Time 
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* A time hmit of 12 hours was considered in applying the method of 



optimizing the results has no runtime overhead. Hence, combining different 
heuristics (i.e., cycle-based approach and random_driver procedure) to achieve 
better cost is reasonable. On the other hand, for large functions in category 
2 with considerable differences from the identity function (Distance > 0.5), 
only the cycle-based synthesis method was applied. According to [5], for some 
functions in this category (i.e., hwbll) the method of |^ needs several hours 
to synthesize the function. Similarly, in |19j, the authors stated their synthesis 
algorithm cannot synthesize hwb circuit with over five variables by NCT library 
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Figure 20: The hybrid synthesis framework 



(with 4GB RAM and finite runtime). Memory /runtime limitations will be even 
more challenging for hwb functions with more variables. As can be seen in Table 
[2| both average cost and runtime were improved for functions in category 2. 

On the other hand, for functions in category 3 which have some similarities 
to the identity function {Distance < 0.5), RM-based method is used in the 
proposed hybrid framework. A reversible function with large Distance can have 
regular distribution at its output side (e.g., f{i) = 2^~^ — i where Distance{f) = 
1). Hence, number of patterns (NoP) in the distribution of output code words 
was also used in the proposed hybrid framework. Regular output distribution 



leads to a small NoP. Fig. 21 shows output patterns for ham? function {NoP = 
12). A function with an appropriate number of patterns {NoP < Th) at its 
output code words is similar to the identity function to some extent. Hence, 
such function was synthesized by using the RM-based method too. For example. 
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Figure 21: Output patterns for ham? function with NoP — 12 



modl024adder with Distance = 0.66 and NoP — 1000 was synthesized by 
applying the RM-based method. We set Th = 0.005 x 2" in our experiments. 

The results of hybrid synthesis framework were shown in Table [2] where 
A:-cycle-based, random_driver and RM-based methods were denoted by kC, R, 
and RM, respectively. Runtime results (in seconds) for the hybrid framework 
were reported in Table [2] too. According to the experimental results, RM-based 
method works very fast for functions in category 3 compared with category 
2. Therefore, the proposed hybrid synthesis framework outperforms the best 
results in terms of quantum cost and runtime on average. Our synthesis tool 
potentially can synthesize functions with any number of variables. However, as 
the number of variables and resulted synthesized gates grows, the runtime and 
memory usage grow accordingly (for hwb functions with n > 20, peak memory 
usage was more than 2GB). 

Since both cycle-based and RM-based methods [B] always result in a synthe- 
sized circuit, the proposed framework always converges. Moreover, as a generic 
reversible function / with large n and Distance{f) > 0.5 without regular pat- 
terns at its output side needs much more gates in the proposed hybrid frame- 
work compared with other functions, the worst-case cost of hybrid framework is 
identical to the worst-case cost of the cycle-based method (i.e., 8.5n2" + o(2")). 

5 Conclusion and future directions 

In this paper, a fc-cycle-based synthesis method for reversible functions was 
proposed and analyzed in detail. To this end, a set of synthesis algorithms 
was proposed to synthesize cycles of length less than 6 (i.e., elementary cycles). 
In addition, a decomposition algorithm was introduced to decompose a large 
cycle into a set of elementary cycles. Next, the decomposition algorithm and 
the proposed synthesis algorithms were used to synthesize all permutations. By 
evaluating different benchmark functions, the behavior of cycle-based synthesis 
method was analyzed and a hybrid synthesis framework was introduced which 
uses the proposed cycle-based synthesis method in conjunction with one of the 
recent synthesis methods. 

Our worst-case analysis revealed that the proposed hybrid synthesis framework 
leads to a lower upper bound compared to the present synthesis algorithms. 
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The hybrid framework always converges and it leads to better average runtime. 
The experiments for average-case costs revealed that the proposed framework 
produces circuits with lower costs for benchmark functions. 
A natural next step to continue this path is working on the synthesis of cycles 
with length greater than 5 for the average-case cost improvement in the A;-cycle- 
based synthesis method which can improve the results of hybrid framework too. 
In addition, working on a synthesis approach for incompletely specified functions 
based on the one proposed here could be considered as a future research. 
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